Spotify is one of the most popular music streaming platforms globally. The creator of the data set I am using collected the data from Spotify's top songs charts and enriched it with similar data from other music streaming platforms to create a comprehensive dataset of the most popular songs on Spotify. According to the creator, Muhammad Abdullah, the data was collected by using "Pyhton [sic] with beautifulSoup" to perform "web scrapping [sic] using API of Spotify". The dataset itself has been dedicated to the public domain.
This data affects many people, most notably the artists of the songs. From the ethical lens of Positionality, we must consider that everyone, including myself, is biased towards and against different musical artists and genres. It is important to set that bias aside to look at the data objectively. Looking through the lens of Power, we might acknowledge that most of this data comes from Spotify, the most popular music streaming service in the world. Spotify's immense power in the music industry, and by extension this data's power, should be kept in mind as the data analysis is performed.
In this report I will explore questions regarding the popularity of songs and how their traits affect it. Does a song's BPM (beats per minute) influence its rank on the Spotify charts? Do songs in the major mode rank higher than songs in the minor mode? Can a song's rank on the Spotify charts, and by extension its popularity, be predicted by its characteristics? The answers to questions like these could bring about valuable insight regarding the reasons why certain songs become more popular than others.
Firstly, I will import the dataset plus some necessary libraries, perform one small data wrangling step, and explore the data with various visualizations.
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
spotify = pd.read_csv("Spotify Most Streamed Songs.csv")
spotify
| track_name | artist(s)_name | artist_count | released_year | released_month | released_day | in_spotify_playlists | in_spotify_charts | streams | in_apple_playlists | ... | key | mode | danceability_% | valence_% | energy_% | acousticness_% | instrumentalness_% | liveness_% | speechiness_% | cover_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Seven (feat. Latto) (Explicit Ver.) | Latto, Jung Kook | 2 | 2023 | 7 | 14 | 553 | 147 | 141381703 | 43 | ... | B | Major | 80 | 89 | 83 | 31 | 0 | 8 | 4 | Not Found |
| 1 | LALA | Myke Towers | 1 | 2023 | 3 | 23 | 1474 | 48 | 133716286 | 48 | ... | C# | Major | 71 | 61 | 74 | 7 | 0 | 10 | 4 | https://i.scdn.co/image/ab67616d0000b2730656d5... |
| 2 | vampire | Olivia Rodrigo | 1 | 2023 | 6 | 30 | 1397 | 113 | 140003974 | 94 | ... | F | Major | 51 | 32 | 53 | 17 | 0 | 31 | 6 | https://i.scdn.co/image/ab67616d0000b273e85259... |
| 3 | Cruel Summer | Taylor Swift | 1 | 2019 | 8 | 23 | 7858 | 100 | 800840817 | 116 | ... | A | Major | 55 | 58 | 72 | 11 | 0 | 11 | 15 | https://i.scdn.co/image/ab67616d0000b273e787cf... |
| 4 | WHERE SHE GOES | Bad Bunny | 1 | 2023 | 5 | 18 | 3133 | 50 | 303236322 | 84 | ... | A | Minor | 65 | 23 | 80 | 14 | 63 | 11 | 6 | https://i.scdn.co/image/ab67616d0000b273ab5c9c... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 948 | My Mind & Me | Selena Gomez | 1 | 2022 | 11 | 3 | 953 | 0 | 91473363 | 61 | ... | A | Major | 60 | 24 | 39 | 57 | 0 | 8 | 3 | https://i.scdn.co/image/ab67616d0000b2730f5397... |
| 949 | Bigger Than The Whole Sky | Taylor Swift | 1 | 2022 | 10 | 21 | 1180 | 0 | 121871870 | 4 | ... | F# | Major | 42 | 7 | 24 | 83 | 1 | 12 | 6 | https://i.scdn.co/image/ab67616d0000b273e0b60c... |
| 950 | A Veces (feat. Feid) | Feid, Paulo Londra | 2 | 2022 | 11 | 3 | 573 | 0 | 73513683 | 2 | ... | C# | Major | 80 | 81 | 67 | 4 | 0 | 8 | 6 | Not Found |
| 951 | En La De Ella | Feid, Sech, Jhayco | 3 | 2022 | 10 | 20 | 1320 | 0 | 133895612 | 29 | ... | C# | Major | 82 | 67 | 77 | 8 | 0 | 12 | 5 | Not Found |
| 952 | Alone | Burna Boy | 1 | 2022 | 11 | 4 | 782 | 2 | 96007391 | 27 | ... | E | Minor | 61 | 32 | 67 | 15 | 0 | 11 | 5 | https://i.scdn.co/image/ab67616d0000b273992a1f... |
953 rows × 25 columns
spotify.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 953 entries, 0 to 952 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_name 953 non-null object 1 artist(s)_name 953 non-null object 2 artist_count 953 non-null int64 3 released_year 953 non-null int64 4 released_month 953 non-null int64 5 released_day 953 non-null int64 6 in_spotify_playlists 953 non-null int64 7 in_spotify_charts 953 non-null int64 8 streams 953 non-null object 9 in_apple_playlists 953 non-null int64 10 in_apple_charts 953 non-null int64 11 in_deezer_playlists 953 non-null object 12 in_deezer_charts 953 non-null int64 13 in_shazam_charts 903 non-null object 14 bpm 953 non-null int64 15 key 858 non-null object 16 mode 953 non-null object 17 danceability_% 953 non-null int64 18 valence_% 953 non-null int64 19 energy_% 953 non-null int64 20 acousticness_% 953 non-null int64 21 instrumentalness_% 953 non-null int64 22 liveness_% 953 non-null int64 23 speechiness_% 953 non-null int64 24 cover_url 953 non-null object dtypes: int64(17), object(8) memory usage: 186.3+ KB
This dataset has 24 columns and 953 rows. Each row represents one song. The variables that are most important for our purposes are in_spotify_charts (the song's rank on the Spotify charts), bpm (the beats per minute or tempo of the song), and mode (whether the song is in major or minor).
There are a few issues with the data types of some of the variables, but they don't need to be adressed; none of them will affect our data analysis since we won't be using the variables with incorrect data types anyway. However, there is another problem with the in_spotify_charts variable. Let's view each distinct value for this variable to find the problem.
spotify.in_spotify_charts.value_counts()
in_spotify_charts
0 405
4 48
2 42
6 36
3 18
...
76 1
58 1
79 1
66 1
147 1
Name: count, Length: 82, dtype: int64
It seems that there's an overwhelming amount of songs in the data that are supposedly ranked 0 on the Spotify charts. However, the ranks on the charts don't go lower (lower numerically, higher functionally) than 1; a rank of 0 simply doesn't exist. My best guess is that these songs made it to the charts on other music streaming services but didn't on the Spotify charts, meaning they have no rank (therefore a rank of 0) on Spotify. Songs that didn't rank on the Spotify charts are useless for our purpose, so we'll filter them out of the data. To remind ourselves that this is filtered data, not the whole dataset, we'll call the filtered data spotify_charters_only.
spotify_charters_only = spotify[spotify.in_spotify_charts != 0]
To get acquainted with the data, let's visualize a variable that will give us a good representation of the variety of the songs in our data: the release year.
alt.Chart(spotify_charters_only, title='Number of Songs on Spotify Charts By Year Released').mark_bar().encode(
x=alt.X('released_year:N', title='Year Released'),
y=alt.Y('count():Q', title='Number of Songs'),
color=alt.Color('released_year:N', legend=None)
).properties(width=400, height=400)
The bar plot above shows the breakdown of release year for all the songs in our dataset. It seems most of the songs were released in the 2000s, with the vast majority releasing in 2022 and 2023. This tells me that newer songs are far more prevalent on the Spotify charts. I would assume that this means newer songs are generally more popular than older songs, but before coming to that conclusion it would be beneficial to see seeing how the ranking on the charts changes as the year released increases.
alt.Chart(spotify_charters_only, title='Trend of Average Rank on Spotify Charts Across Year Released').mark_line().encode(
x=alt.X('released_year:N', title='Year Released'),
y=alt.Y('mean(in_spotify_charts):Q', title='Average Rank on Spotify Charts')
)
The line plot above shows how the averageranking of a song on the Spotify charts changes as the year the song was released increases.
When looking at this chart (and future ones as well), it is important to keep in mind that a lower number for the rank variable means that the song "ranked higher" or was more popular. I will be referring to lower values as higher ranks (and vice versa) from now on.
Fron the line plot, it seems that the average rank changes depending on the specific year instead of there being a trend over time. The average rank is all over the place until about 2019, when it begins to change less across the years. I believe this is because songs released in those years are more prevalant in the data, meaning that the average is less influenced by each individual song. Either way, it is clear that songs' average ranks do seem to vary depending on the year released, which is a good thing to keep in mind.
It might be interesting to see if the trend of in_spotify_charts behaves the same way when examined across the month released.
alt.Chart(spotify_charters_only, title='Trend of Average Rank on Spotify Charts Across Month Released').mark_line().encode(
x=alt.X('released_month:N', title='Month Released'),
y=alt.Y('mean(in_spotify_charts):Q', title='Average Rank on Spotify Charts').scale(zero=False)
)
The line plot above shows how the average ranking of a song on the Spotify charts changes as the month the song was released increases. Once again, there seems to be significant changes depending on the months released. Songs released in July appear to rank the lowest on average, whereas the average song released in September ranks higher than the other months. There seems to be a zig-zag pattern where the rank increases from January to April, decreases from April to July, increases from July to September, and decreases from September to January. My only guess for why songs would perform better in September would be that it somehow has to do with back-to-school season, since that's the main thing that happens in Spetember.
Now, let's look at how a song's rank is affected by a different variable. We'll pick the key of the song, since the key is a good indicator of how a song sounds.
alt.Chart(spotify_charters_only, title='Comparing Rank on Spotify Charts for each Key').mark_boxplot(clip=True).encode(
x=alt.X('key:N', title='Key of Song'),
y=alt.Y('in_spotify_charts:Q', title='Rank on Spotify Charts').scale(domain=[0, 100]),
color=alt.Color('key:N', legend=None)
).properties(width=300, height=300)
This is a box plot showing the distribution of the rank on the Spotify charts for songs in each key. It seems that the median rank does vary depending on the key, although this doesn't prove that it the variation is caused by the key being different. Songs in a D# key seem to rank highest on the charts with a median rank of 7, whereas songs in a D key seem to rank the lowest with a median rank of 22.5. This could be useful in the future when I pick variables to use as predictors for a song's rank on the Spotify charts.
Next, I want to examine the unique song attributes Spotify collects. Spotify calculates different attributes for their songs such as danceability, valence, energy, etc. Below is a brief description of all their meanings, paraphrased from this website:
Now, let's look at all of them and their relationships with rank on Spotify charts.
attributes = ['danceability_%', 'valence_%', 'energy_%', 'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']
alt.Chart(spotify_charters_only, title = 'Relationships Between Attributes and Rank on Spotify Charts').mark_rect().encode(
alt.X(alt.repeat("column"), type='quantitative').bin(maxbins=30),
alt.Y('in_spotify_charts:Q', title='Rank on Spotify Charts').bin(maxbins=30),
color=alt.Color("count():Q")
).properties(
width=250,
height=250
).repeat(
column=attributes
)
These heatmaps show the relationships between various attributes of a song and and it's rank on the Spotify charts. These charts suggest that danceability and energy have positive relationships with rank on the charts, meaning as they get higher the song ranks lower. Acousticness, liveness, and speechiness all have negative relationships with the rank, meaning as they get higher the song ranks higher. Finally, valence and instrumentalness seem to have no relationship with rank.
This may suggest that the most popular songs are wordy, acoustic, unenergetic, and less suitable for dancing. This is very helpful in giving us a good picture of the type of song that becomes popular.
Let's continue looking at how different variables relate to a song's rank on the Spotify charts. We'll pick bpm as our attribute to use, since I would assume the speed of a song would influence it's popularity. A good way to test that theory would be a hypothesis test for correlation. First, we'll need a null and alternative hypothesis.
Before we perform the analysis, let's view the relationship with a regression plot.
chart = alt.Chart(spotify_charters_only, title = "Relationship Between Beats Per Minute and Rank on Spotify Charts").mark_point().encode(
x=alt.X("bpm:Q", title = "BPM (Beats Per Minute)").scale(zero=False),
y=alt.Y("in_spotify_charts:Q", title = "Rank on Spotify Charts")
)
chart + chart.transform_regression('bpm', 'in_spotify_charts').mark_line(color="red")
This plot shows the relationship between a song's BPM and its rank on the Spotify charts. Contrary to what I expected, there seems to be no correlation between the two variables, since the scatterplot has no pattern and the regression line is horizontal.
To support this finding, let's look at the correlation coefficient.
corr_BPM_rank = spotify_charters_only.bpm.corr(spotify_charters_only.in_spotify_charts)
corr_BPM_rank
-0.0014270816683671394
The correlation coefficient is -0.001, implying a weak negative relationship between BPM and rank on the Spotify charts. This explains what we saw in the regression plot; the correlation was so weak that it looked non-existent in the visualization.
Although the correlation is extremely weak, we should still run the permutation test to determine if the correlation is real or a product of rrandom chance.
#Run 10,000 permutations using the function we created in class.
def simulate_correlation(df,var1,var2):
shuffled = df[var1].sample(frac=1).reset_index(drop=True)
corr = shuffled.corr(df[var2])
return corr
list = []
for i in range(10000):
one_perm = simulate_correlation(spotify_charters_only,'bpm', 'in_spotify_charts')
list.append(one_perm)
list = pd.DataFrame({'list':list})
#Graph the permutation distribution with a red line showing the observed correlation.
alt.data_transformers.disable_max_rows()
histogram = alt.Chart(list).mark_bar().encode(
x=alt.X("list:Q", title="Permutations").bin(maxbins=20),
y=alt.Y("count():Q")
)
list = list.assign(correlation_disp=corr_BPM_rank)
line = alt.Chart(list).mark_rule(color="red", strokeDash=(8,4)).encode(
x=alt.X("correlation_disp")
)
histogram + line
The plot above shows the observed correlation plotted on a histogram of the permutation distribution. The permutations range from about -0.20 to 0.20, and the dotted line is right in the middle of the histogram around 0.0. This tells me that the weak correlation we observed above is not a statistically significant result. However, we need to find the p-value to confirm these results.
p_value = np.mean(list.list > list.correlation_disp)
p_value
0.5065
The p-value is 0.502, which is significantly higher than the alpha of .05. This result suggests that there is no statistically significant correlation between BPM and rank on the Spotify charts. While it may be a surprising result, it's useful to know that tempo doesn't seem to factor into a song's popularity.
Next, let's analyze another variable that might infuence a song's popularity: the mode of the song. Modes are variations of musical scales, which essentially means that they all give songs different sounds. Songs in major often sound bright and cheerful, whereas songs in minor sound sad or somber. It would be interestng to see if songs in one mode rank higher on average than songs in another mode. To answer that question, we can perform a hypothesis test for comparison of means.
First, we'll need a null and alternative hypothesis.
I hypothesize that songs in major rank higher than songs in minor because it would make sense to me that cheerful songs would be more popular than somber songs. Therefore, in my null hypothesis, I said major would be less than minor because the in_spotify_charts variable is inverted.
To begin, let's plot the difference in distributions between songs in major and songs in minor.
alt.Chart(spotify_charters_only, title = "Comparing the Distribution of Rank on Spotify Charts for Major and Minor Songs").mark_boxplot(clip=True, size=30).encode(
x=alt.X("mode:N", title = "Mode"),
y=alt.Y("in_spotify_charts:Q", title = "Rank on Spotify Charts").scale(domain=[0,100]),
color=alt.Color("mode:N", legend=None)
).properties(height = 250, width = 150)
This box plot shows the difference in the distributions of ranks for songs in major and minor. The median rank for songs in major is 13, and the mean for songs in minor is 14. It seems that songs in major might rank higher on average than songs in minor, but the difference is so small that it could very well be nonexistent.
To explore this further, we can make a bar graph to compare the means instead of the medians.
chart = alt.Chart(spotify_charters_only, title = "Comparing the Mean Rank on Spotify Charts for Major and Minor Songs").mark_bar().encode(
y=alt.Y("mode:N", title = "Mode"),
x=alt.X("mean(in_spotify_charts):Q", title = "Rank on Spotify Charts"),
color=alt.Color("mode:N", legend=None)
).properties(height = 100, width = 250)
error_bars = alt.Chart(spotify_charters_only).mark_errorbar(extent='ci').encode(
y=alt.Y("mode:N"),
x=alt.X("mean(in_spotify_charts):Q", title = "Rank on Spotify Charts"),
)
chart + error_bars
This bar plot compares the mean ranks for both categories. Once again, it seems like minor songs rank about 1 place lower on average. However, there's a lot of overlap in the confidence intervals, which suggests that there may actually be no difference.
Before we begin the permutation test, let's make sure that we have a relatively similar number of songs in both modes.
alt.Chart(spotify_charters_only, title = "Comparing the Number of Songs in Each Mode").mark_bar().encode(
y=alt.Y("mode:N", title = "Mode"),
x=alt.X("count():Q", title = "Number of Songs"),
color=alt.Color('mode:N', legend=None)
).properties(height = 100, width = 250)
This is a bar plot comparing the number of songs in major and minor. It seems that there are about 70 more songs in major than songs in minor, but that difference isn't big enough to cause problems in our analysis. Therefore, we can move on to the permutation test.
major_rank = spotify[spotify['mode'] == "Major"].in_spotify_charts
minor_rank = spotify[spotify['mode'] == "Minor"].in_spotify_charts
observed_difference = major_rank.mean() - minor_rank.mean()
observed_difference
-1.4411910669975185
The code above is displaying the difference in means for the two groups. In our data, we are observing that songs in major seem to rank about 1.5 places higher than songs in minor.
def simulate_two_groups(data1, data2):
n = len(data1) #Get length of first group
data = pd.concat([data1, data2]) #Get all data
data = data.sample(frac=1) #Reshuffle all data
group1 = data.iloc[:n] #Get random first group
group2 = data.iloc[n:] #Get random second group
return group1.mean() - group2.mean() #Calculate mean difference
permutations = []
for i in range(10000):
one_perm = simulate_two_groups(major_rank, minor_rank)
permutations.append(one_perm)
permutations = pd.DataFrame({'permutations':permutations})
alt.data_transformers.disable_max_rows()
histogram = alt.Chart(permutations).mark_bar().encode(
x=alt.X("permutations:Q", title="Permutations").bin(maxbins=20),
y=alt.Y("count():Q")
)
permutations = permutations.assign(observed_difference=observed_difference) # Add the mean to the dataframe
observed_diff = alt.Chart(permutations).mark_rule(color="red", strokeDash=(8,4)).encode(
x=alt.X("observed_difference", title="Permutations")
)
histogram + observed_diff
The plot above shows the observed difference in means plotted on a histogram of the permutation distribution. It seems that the permutations range about 5 places in both directions. The observed difference is relatively close to the middle of the histogram, suggesting that it is not statistically significant.
p_value = np.mean(permutations.permutations > observed_difference)
p_value
0.869
The p-value is 0.87, which is significantly above the alpha value of .05. This result suggests that there is no statistically significant difference between the mean rank on Spotify top charts for songs in major and songs in minor. This is surprising to see since I had assumed there would be a clear difference. However, we now have a clearer picture of what factors do and don't affect a song's popularity.
The ultimate goal of this report is to see if we can make a model that can predict a song's rank on the Spotify charts using its characteristics. This will hopefully give us insight on what factors make a song popular. For this task, I have chosen a randon forest regressot model. This model was selected through trial and error. Before we can begin modeling, though, we need to address a problem with outliers. Let's look at a histogram of the rank variable.
alt.Chart(spotify_charters_only, title="Distribution of Rank on Spotify Charts").mark_bar().encode(
x=alt.X("in_spotify_charts:Q", title = "Rank on Spotify Charts").bin(maxbins=20),
y=alt.Y("count():Q", title="Count")
)
This histogram shows the distribution of ranks on the Spotify charts for all the songs. it seems the vast majority of the songs are between ranks 1-50, but there are some songs that are ranked all the way out around 140. If we keep these, it'll throw off our model, so we'll filter our data to just focus on songs in the top 50 ranks.
spotify_remove_outliers = spotify[(spotify.in_spotify_charts <= 50) & (spotify.in_spotify_charts != 0)]
spotify_remove_outliers[['track_name', 'in_spotify_charts']].sort_values('in_spotify_charts', ascending = False)
| track_name | in_spotify_charts | |
|---|---|---|
| 6 | Ella Baila Sola | 50 |
| 4 | WHERE SHE GOES | 50 |
| 63 | BESO | 50 |
| 89 | MONTAGEM - FR PUNK | 50 |
| 34 | TQG | 49 |
| ... | ... | ... |
| 756 | Golden | 1 |
| 786 | Un Verano Sin Ti | 1 |
| 851 | Daydreaming | 1 |
| 846 | Keep Driving | 1 |
| 695 | Adore You | 1 |
502 rows × 2 columns
As you can see, we've now filtered the data down to songs in the top 50 ranks, which leaves us with 502 usable samples. Now, we can begin modeling.
First, we'll create a decision tree to pick the best predictors. For our initial predictors, I'll add all of the predictors that we determined to be useful in our data exploration (released year & month, key, all the song attributes minus valence and instrumentalness). I'll also add the number of artists as a predictor. Note that I am not using mode or BPM as predictors since our previous permutation tests have suggested they don't influence a song's rank.
predictors = ['released_year', 'released_month', 'danceability_%', 'energy_%', 'acousticness_%', 'liveness_%', 'speechiness_%', 'key', 'artist_count']
target = 'in_spotify_charts'
reg = tree.DecisionTreeRegressor(
splitter='best',
criterion='poisson',
random_state=42,
max_leaf_nodes=10,
min_samples_leaf=1
)
X = pd.get_dummies(spotify_remove_outliers[predictors], drop_first = True)
y = spotify_remove_outliers[target]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=0)
reg = reg.fit(X_train, y_train)
plt.figure(figsize=(20,20))
tree.plot_tree(reg,
feature_names=reg.feature_names_in_,
filled=True,
rounded=True
)
plt.show()
This decision tree originally splits by whether the song was released in/before 2022 or after 2022. For songs released after 2022, the model looks at month released, liveness, and energy. For songs released in/before 2022, it looks at a lot more variables. Just from looking at the branches, it seems that the highest ranking songs are identified by having low energy, low danceability, high liveness, and high acousticness. This is consistent with what we saw in our data exploration when we made the heatmaps.
for f,i in zip(reg.feature_names_in_, reg.feature_importances_):
print(f,i)
released_year 0.2194345043128173 released_month 0.07305098660090903 danceability_% 0.12488137410428594 energy_% 0.17097702276741425 acousticness_% 0.20621449117282306 liveness_% 0.07525013538403379 speechiness_% 0.0 artist_count 0.0 key_A# 0.0 key_B 0.0 key_C# 0.0 key_D 0.13019148565771652 key_D# 0.0 key_E 0.0 key_F 0.0 key_F# 0.0 key_G 0.0 key_G# 0.0
The list of feature importances suggest that released year, released month, danceability, energy, acousticness, and liveness are the best predictors. After a lot of trial and error, I found that my model performed the best when I removed released month and acousticness from that list, which leaves us with four predictor variables to use for the random forest model.
predictors = ['released_year', 'danceability_%', 'energy_%', 'liveness_%']
target = 'in_spotify_charts'
rf_reg = RandomForestRegressor(
criterion='poisson',
random_state=42,
max_leaf_nodes=7,
min_samples_leaf=25
)
X = pd.get_dummies(spotify_remove_outliers[predictors], drop_first = True)
y = spotify_remove_outliers[target]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=0)
rf_reg = rf_reg.fit(X_train, y_train)
for f,i in zip(rf_reg.feature_names_in_, rf_reg.feature_importances_):
print(f,i)
released_year 0.35791830161104515 danceability_% 0.12932412409250524 energy_% 0.3687851993941776 liveness_% 0.14397237490227202
After narrowing down our predictors, it seems that released year and energy have the highest predicting power, whereas danceability and liveness have a lesser feature importance. However, trial and error has provent hat these are the best possible predictors, so I will overlook the fact that the predictors have varying importances.
This is very interesting as it tells us that a song's rank can be predicted just by its release year, danceability, energy, and liveness. I expected the first three to be good predictors, but I had no idea that liveness would be a good predictor as well.
predictions = rf_reg.predict(X_test)
residuals = y_test - predictions
print(f"Root mean squared error: {np.sqrt(mean_squared_error(y_test, predictions)):.2f}")
Root mean squared error: 12.58
The root mean squared error for our random forest regressor is 12.58, meaning our predictions are off on average by about 12 places. I was hoping to get the root mean squared error to under 10 places, but this isn't a bad result.
results = pd.DataFrame({'Predictions': predictions, 'Residuals':residuals})
alt.Chart(results, title="Histogram of Residuals").mark_bar().encode(
x=alt.X('Residuals:Q', title="Residuals").bin(maxbins=30),
y=alt.Y('count():Q', title="Value Counts")
)
This is a histogram showing the distribution of the residuals. Ideally, I would want to see a normal distribution centered around zero, but this is definitely not normal nor centered around zero. Most of the residuals are negative, meaning the model seems to be predicting that a song will rank more poorly than it actually did.
My hypothesis for this phenomenon is that there's a very important variable affecting a song's rank that we don't have in the dataset: star power. If a song comes from a very popular artist, it will almost definitely perform better than a similar song from a lesser-known artist. However, since we don't have a variable about artist popularity, the model can't factor that into its predictions. I think that might be the reason why the model is predicting too low.
In this report, we discovered a lot about what attributes of a song affect it's popularity. From our EDA and decision tree, we discovered that the most popular songs have low energy, low danceability, high liveness, and high acousticness. We also learned that attributes like the tempo and mode don't seem to make a song more or less popular. Additionally, we dicovered that the Spotify top charts have a heavy bias towards songs released in or after 2022, meaning newer songs are often the most popular. These findings give us insight into the musical preferences of the majority and what the people value most in a song.
Using these findings, we were able to create a random forest regression model that could predict a song's rank on the Spotify charts, although it's predictions were off by about 12 places on average. With more variables and data, I imagine it's possible to create a much more accurate model. In fact, a user called tortmedovik analyzed this data and used a DecisionTreeRegressor for the same dataset and was able to achieve a mean squared error that was essentially zero. Still, our model performed well and taught us a lot about what aspects of a song make it popular. Overall, we succesfully answered all of our research questions.
“Spotify Prediction.” Kaggle, Kaggle, 22 Oct. 2024, www.kaggle.com/code/tortmedovik/spotify-prediction.